A Note on Zipf's Law, Natural Languages, and Noncoding DNA regions
نویسندگان
چکیده
In Phys. Rev. Letters, 73:2, 5 Dec. 94, Mantegna et al. conclude on the basis of Zipf rank frequency data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue on the contrary that an empirical t to Zipf's \law" cannot be used as a criterion for similarity to natural languages. Although DNA is a presumably an \organized system of signs" in Mandelbrot's (1961) sense, an observation of statistical features of the sort presented in the Mantegna et al. paper does not shed light on the similarity between DNA's \grammar" and natural language grammars, just as the observation of exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an M sided die or a nite-state branching process. to analyzing linguistic texts to the statistical study of DNA base pair sequences and nd that the noncoding regions are more similar to natural languages than the coding sequences" (p. 3169). Speciically, the authors analyze coding/noncoding DNA sequences and conclude that noncoding regions show a more Zipf-like behavior than coding regions. Asserting that \A remarkable feature of languages is Zipf's law" (p. 3169), they further conclude that noncoding regions are more similar to natural languages than coding regions (p. 3170): 1
منابع مشابه
No Signs of Hidden Language in Noncoding DNA
Recent comparison between the statistical properties of coding and noncoding DNA sequences have been interpreted as indicating a yet-undiscovered language in noncoding DNA [1]. We argue that greater variance among nucleotide frequencies in noncoding regions explain most of the observations, which undercuts the claims in [1]. DNA sequences are long strings composed of four nucleotides (A,C,G, an...
متن کاملLack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis.
Recently, the application of two statistical methods (related to Zipf's distribution and Shannon's redundancy), called 'linguistic' tests, to the primary structure of DNA sequences of living organisms has excited considerable interest. Of particular importance is the claim that noncoding DNA sequences in eukaryotes display specific 'linguistic' features, being reminiscent of natural languages. ...
متن کاملZipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with dif...
متن کاملInvestigating Esperanto's Statistical Proportions Relative to other Languages using Neural Networks and Zipf's Law
Esperanto is a constructed natural language, which was intended to be an easy-to-learn lingua franca. Zipf's law models the statistical proportions of various phenomena in human ecology, including natural languages. Given Esperanto’s artificial origins, one wonders how “natural” it appears, relative to other natural languages, in the context of Zipf’s law. To explore this question, we collected...
متن کاملZipf's law against the text size: a half-rational model
In this article, we consider Zipf-Mandelbrot law as applied to texts in natural languages. We present a simple model of dependence of the law on the text size, which is featured by variable power-law tail and constant ratio of the most frequent words. As a result we derive several closed formulas, which accord with empirical data qualitatively and partially quantitatively. For example, there ap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره cmp-lg/9503012 شماره
صفحات -
تاریخ انتشار 1995